The latent words language model

نویسندگان

  • Koen Deschacht
  • Jan De Belder
  • Marie-Francine Moens
چکیده

Statistical language models have found many applications in information retrieval since their introduction almost three decades ago. Currently the most popular models are n-gram models, which are known to suffer from serious sparseness issues, which is a result of the large vocabulary size |V | of any given corpus and of the exponential nature of n-grams, where potentially |V | n-grams can occur in a corpus. Even when many n-grams in fact never occur due to grammatical and semantic restrictions in natural language, we still observe and exponential growth in unique n-grams with increasing n. Smoothing methods combine (specific, but sparse and potentially unreliable) higher order ngrams with (less specific but more reliable) lower order n-grams. (Goodman, 2001) found that interpolated Kneser-Ney smoothing (IKN) performed best in a comparison of different smoothing methods in terms of the perplexity on a previously unseen corpus. In this article we describe a novel language model that aims at solving this sparseness problem and in the process learns syntactic and semantic similar words, resulting in an improved language model in terms of perplexity reduction.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fitting long-range information using interpolated distanced n-grams and cache models into a latent dirichlet language model for speech recognition

We propose a language modeling (LM) approach using interpolated distanced n-grams into a latent Dirichlet language model (LDLM) [1] for speech recognition. The LDLM relaxes the bag-of-words assumption and document topic extraction of latent Dirichlet allocation (LDA). It uses default background ngrams where topic information is extracted from the (n-1) history words through Dirichlet distributi...

متن کامل

Spatial Latent Dirichlet Allocation

In recent years, the language model Latent Dirichlet Allocation (LDA), which clusters co-occurring words into topics, has been widely applied in the computer vision field. However, many of these applications have difficulty with modeling the spatial and temporal structure among visual words, since LDA assumes that a document is a “bag-of-words”. It is also critical to properly design “words” an...

متن کامل

A new model for persian multi-part words edition based on statistical machine translation

Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...

متن کامل

Mixture of latent words language models for domain adaptation

This paper introduces a novel language model (LM) adaptation method based on mixture of latent word language models (LWLMs). LMs are often constructed as mixture of n-gram models, whose mixture weights are optimized using target domain data. However, n-gram mixture modeling is not flexible enough for domain adaptation because model merger is conducted on the observed word space. Since the words...

متن کامل

A Latent Variable Recurrent Neural Network for Discourse Relation Language Models

This paper presents a novel latent variable recurrent neural network architecture for jointly modeling sequences of words and (possibly latent) discourse relations that link adjacent sentences. A recurrent neural network generates individual words, thus reaping the benefits of discriminatively-trained vector representations. The discourse relations are represented with a latent variable, which ...

متن کامل

Interpolated Dirichlet Class Language Model for Speech Recognition Incorporating Long-distance N-grams

We propose a language modeling (LM) approach incorporating interpolated distanced n-grams in a Dirichlet class language model (DCLM) (Chien and Chueh, 2011) for speech recognition. The DCLM relaxes the bag-of-words assumption and documents topic extraction of latent Dirichlet allocation (LDA). The latent variable of DCLM reflects the class information of an n-gram event rather than the topic in...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Computer Speech & Language

دوره 26  شماره 

صفحات  -

تاریخ انتشار 2012